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Abstract 

One of the significant advantages in problems with perfect information, like 
search or games like checkers, is that they can be decomposed into independent 
pieces. In contrast, problems with imperfect information, like market modeling or 
games like poker, are treated as a single decomposable whole. Handling the game 
as a single unit places a much stricter limit on the size of solvable imperfect infor- 
mation games. This paper has two main contributions. First, we introduce CFR-D, 
a new variant of the counterfactual regret minimising family of algorithms. For 
any problem which can be decomposed into a trunk and subproblems, CFR-D can 
handle the trunk and each subproblem independently. Decomposition lets CFR-D 
have memory requirements which are sub-linear in the number of decision points, 
a desirable property more commonly associated with perfect information algo- 
rithms. Second, we present an algorithm for recovering an equilibrium strategy in 
a subproblem given the trunk strategy and some summary information about the 
subproblem. 



1 Introduction 

Perfect information games like checkers, where game states are entirely public, have 
proven more tractable than games of imperfect information like poker, where some in- 
formation about the game state is hidden from one or more players. The primary reason 
is that perfect information games can easily be split into subgames. Any time a player 
is about to act, the subgames following each possible action can be reasoned about 
independently from the other actions, from how the state in the game was reached, and 
from other unreachable states of the game. Reasoning about subgames independently 
allows for time and memory efficient algorithms like depth-first iterative-deepening [ 7 1 . 

The corresponding application of decomposition to imperfect information games 
does not work. Consider the games in Figure [T] In the left game, the first player gets 
to choose whether to play tic-tac-toe or checkers. Within this game, the tic-tac-toe and 
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Game A 



Game B 



Figure 1: Subproblem Example 

checkers subproblems can be analysed independently, and the correct choice at the root 
of the game can be made afterwards on the basis of the values of the subproblems. 

In the right game, there is a chance event, which only the first player ever gets 
to see. Additionally, some of the outcomes are modified based on the outcome of 
the first chance event, producing new subproblems like checkers' and checkers". The 
second player does not know which variant they are playing. The correct strategy for 
player two in checkers'/checkers" (or, similarly, tic-tac-toe) depends on the likelihood 
that player one chooses to play checkers' and checkers". This choice, however, will 
depend on the value of the subproblems, which depends on the player two strategy in 
these subproblems. By adding hidden information which affects the game outcome, 
the subproblems can no longer be analysed independently in the same way. Any fixed 
assumptions about the initial player one choice — even if the initial choice is correct 
for some Nash equilibrium — can suggest a player two strategy that is exploitable by 
a different initial choice for player one. 

Even though decomposing an imperfect information game has previously had no 
formal guarantees, the memory savings are large enough that decomposition has still 
been employed in coping with human-scale domains, such as poker. Both PS-Opti [1 1 
and GS 1 [4] were strong poker AIs for their time, and both chose to split the game in 
an unsafe fashion. We present, for the first time, a general method for generating an 
error-bounded approximation of a Nash equilibrium through decomposing and inde- 
pendently analyzing subproblems of an imperfect information game. We compare this 
to the prior method, showing that the lack of theoretical bound leads to significant error 
in practice. 

One of the tradeoffs of using decomposition to solve a problem is that only part 
of the strategy might be kept. If we save space by not storing part of the solution, the 
missing portions must re-generated as needed. The second contribution of this paper 
is an error-bounded method of recovering a strategy in a subproblem, using only some 
information about the root of the subproblem. 

2 Background 

An extensive-form game is an explicit description of the possible interaction of one or 
more agents in some problem domain. There is a set of players P, which for the sake of 
convenience, we will consider to include a chance agent P c which controls stochastic 
events. H is a set of all possible game states, represented by the history of actions 
taken from the initial game state 0. For any history h E H, P(h) n- P U {leaf} gives 
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the player that is about to act or leaf if the game is over, and A(h) gives the set of valid 
actions. H p is the set of all states ft such that P(h) = p. The state ft ■ a is said to be a 
child of the state ft, ft is the parent of ft ■ a, and we will say ftj is an ancestor of hj or 
hi C hj if hi is the parent of hj or hi C ftfc and hk is the parent of hj. h C j if ft c j 
or ft = j. We will let Z be the set of all leaf states. For every z £ Z, u p (z) i— > 3? gives 
the payoff for player p if the game ends in state z. 

For each player p, hidden information is described by information sets, which are 
a partition X p of H p . For any information set I e X p , any two states ft, j e I are 
indistinguishable to player p. A behaviour strategy a p e S p is a function a p (I, a) i->- -ft 
which assigns a probability distribution over valid actions to every information set I 6 
l p . We will say a(h, a) — a(I(h), a), where 1(h) is the information set which contains 
h. Z(I) = {z s.t. z G Z, z □ ft, e 7} is the set of all terminal states z reachable from 
some state in information set I. We could also consider the leaves reachable from I 
after some action a, stated as Z(I, a) = {z s.t. z e Z, z □ ft • a, ft e /}. Conversely, 
ft [5] is the longest history j in some set of states S for which j C ft. 

In games with perfect recall, every history ft, j in an information set I E l p passed 
through the same sequence of player p information sets, and made the same action at 
those information sets. Informally speaking, this means that a player does not forget 
their own actions, or any information about chance or opponent actions that they have 
observed. Having a unique history of information sets also lets us say an informa- 
tion set J is a child of information set I if for any ft e J, ft [7] is the longest strict 
subsequence of ft where P(h[I]) = P(h). 

A strategy profile a e S is a tuple of strategies, one for each player. Given a 
strategy profile a we can construct a new profile cF( a > p ) which is identical except that 
player p's strategy has been replaced by a' p . It is also useful to refer to certain products 
of probabilities. For any ft e H and a e S, 7r CT (ft) = J\(j a )izh a P(j){ii a ) gi yes the 
joint probability of reaching state ft if all players follow a. We also use 7r p (ft) to refer 
to the product of only the terms where player p acts, and 7r_ p (ft) to refer to the product 
of terms where any player but p acts. We use n(j, ft) to refer to the product of terms 
from j to ft, rather than from to ft. Finally, it is useful to speak of replacing portions 
of a strategy with another strategy: <J[s^ a '\ is the strategy that is equal to a everywhere 
except at information sets in S, where it is equal to a'. 

Given a strategy profile a, the expected utility u p to player p if all players follow 
a is J2 Z ir(z)u p (z). A best response strategy BR p (a) = argmax^, eSp u <T[x p^"p 1 is 
a strategy for p which maximises p's value if all other player strategies remain fixed. 
A Nash equilibrium is a strategy profile where all strategies are best responses, and an 
e-Nash equilibrium approximation is a profile where the expected value for each player 
is within e of the value of a best response strategy. 

All of the work in this paper assumes two player, zero-sum, perfect recall games. 
That is, P = {pi,P2,Pc}, for all z <E Z u pi (z) + u P2 (z) = 0, and for any information 
sets J^J 2 e Xp, if ft e h,j e I 2 then J(ft') ^ 1(f) for all ft' □ ft, / □ j. 
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2.1 Existing CFR variants 

The Counterfactual Regret minimization (CFR) algorithm [ 10 1 is an efficient method 
for finding an approximation of a Nash equilibrium in very large games. CFR is an 
iterated self play algorithm, where the average policy across all iterations approaches 
a Nash equilibrium. It has independent regret minimisation problems being simul- 
taneously updated for every information set, at each iteration. Each minimisation 
problem at an information set I E I p uses a utility function called counterfactual 
value: (I, a) = J2zez(i a) n -p( z ) n p ( Z [I] ' a > z)u p (z). Informally, the counterfac- 
tual value of / for player p is the expected value of / for p of the policy, if p had instead 
always played to reach /. The regret for a series of profiles, called immediate coun- 
terfactual regret, is computed as max ae ^m J2t( v p (-^ a ) — Yl a ' a {^ a ') v p (^: a ')> 
where p = P(I). Note that because ir p (h) — Tr p (j) for all h,j € /, u p (I,a) — 
Y, z(I) ^(z)u p (z)=^(lX(I,a). 

The desired end result, minimising regret across the space of entire strategies, is 
an emergent property. Bounding all of the immediate counterfactual regrets in a tree 
can be shown to place an upper bound on the regret over all possible strategies in that 
tree (called full counterfactual regret.) It then immediately follows that minimising 
immediate counterfactual regrets at all information sets will minimise the regret across 
all strategies, and another short argument shows that an e-regret strategy profile is a 
2e-Nash equilibrium. These proofs of convergence to a Nash equilibrium are given by 
Zinkevich et aim the original CFR paper ifTUl . 

Using separate regret minimisation problems at each information set makes CFR a 
very flexible framework for solving games. First, the values for actions are all that is 
used at any single regret minimisation problem at some information set /. The action 
probabilities of the strategy profile outside / are otherwise irrelevant. Second, while 
the strategy profile outside / is generated by the other minimisation problems in CFR, 
the source does not matter. Any sequence of strategy profiles will do, as long as they 
have low regret. 

As a result, a number of CFR variants have been proposed. Instead of minimising 
regrets at every information set, they could be sampled, leading to the MCCFR |8| 
family of algorithms. Instead of doing self play with CFR generating the strategy for 
both players, CFR-BR |5| uses a best response strategy for one player at each step, 
which guarantees that the regrets for that player are non-positive. All of these existing 
variants, including CFR-BR, stores strategy probabilities and counterfactual regrets at 
every information set for at least one player. Unlike our proposed algorithm, each these 
existing variants have memory usage that is linear in the number of decisions for the 
player. 

3 Decomposition into Subproblems 

The CFR-BR algorithm provided the inspiration for the method we propose in this 
paper for handling games through decomposition. In contrast to previous CFR variants, 
which uses regret minimisation (and storage) to find a strategy for both players, CFR- 
BR uses a best response for one player. The method splits the game up into a trunk — a 
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portion of the game tree rooted at the start of the game — and a number of independent 
subproblems. The only necessary condition for the subproblems is that the boundaries 
of the subproblem do not cross information sets: for any state h in the subproblem, for 
all j £ 1(h), j is also in the subproblem. 

In the trunk, CFR-BR uses regret minimisation to find the policy for both players. 
Within a subproblem, regret minimisation is used only for the player of interest, with 
the opponent using a best response. At each iteration, after computing a best response 
and updating regrets for the player of interest, the best response strategy is discarded. 

As Johanson et al. [5 | show, a best response strategy has non-positive cumulative 
regret after any number of iterations. Having non-positive regret in the subproblems 
means the regret of the average strategy only depends on the regret within the trunk, 
and regret in the trunk is guaranteed to be minimised by running CFR. The resulting 
average strategy is a Nash equilibrium approximation. 

With CFR-D, we propose using a best response strategy for both players in sub- 
problems: a Nash equilibrium within the subproblem. Like CFR-BR, we discard this 
strategy at each iteration. This immediately provides the sub-linear memory proper- 
ties, as we are finding an approximation of a Nash equilibrium, but are only storing the 
strategy within the trunk. Correctness immediately follows from the correctness proof 
in the CFR-BR paper: there are simply more information sets which are guaranteed to 
have non-positive regret. We discuss the method in detail in section[4] 

For our purposes, a more significant property is that there is no need to store any- 
thing about the subproblem for the best response player. All that is needed is the 
probabilities of reaching possible states at the root of the subproblem, which only de- 
pends on the current strategy profile in the trunk. CFR-BR only stores information 
about the CFR player within subproblems. By having both players use best responses, 
our proposed method does not need to store any information about the subproblems. 
Eliminating information within the subproblems reduces the memory requirement to 
be linear in the number of information sets in the trunk, plus whatever amount of mem- 
ory is needed to solve one single subproblem. Depending on the sizes of the trunk and 
subproblems, and the number of subproblems, treating the subproblems independently 
could lead to a substantial reduction in space. For example, solving the game of two 
player limit Texas Hold' em poker with other CFR variants would require on the order 
of 100TB, even though CFR is a memory efficient algorithm and only uses memory on 
the same order as the size of the final strategy description. In contrast, solving the game 
by splitting the game at the second round would only require on the order of 1GB |5 1. 

The cost of discarding subproblem strategies at each iteration is that the output 
of the algorithm does not include a policy for acting within the subproblems. We will 
only have the probabilities for the strategy within the trunk. To get the strategy within a 
subproblem, we must recover it by solving a new problem. The second contribution of 
this paper is an error-bounded method of recovering subproblem strategies. While this 
is needed for CFR-D, this method is not strictly tied to CFR-D, and could be applied 
in other situations. 

Unlike the subproblem solutions needed to solve the trunk, recovering an equilib- 
rium strategy in a subproblem is not simply a matter of finding a subproblem equilib- 
rium strategy given the current trunk policy. For example, consider the case where the 
P2 strategy in the trunk dictates that p2 never reaches the subproblem. Because all leaf 
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values will be 0, any strategy for p\ in the subproblem will be part of an equilibrium 
profile, even with an extra condition ensuring that it is a counterfactual best response. 
pi is free to choose a strategy where the counterfactual value for p2 might be very high. 
If the pi strategy is highly exploitable, p 2 might change their trunk strategy so that it 
plays into the subproblem, taking advantage of the poor pi strategy in that subproblem. 

Using the counterfactual values observed in the trunk at the root of the subproblems, 
we can build a modified subproblem. To find the strategy for a player (for example, 
Pi) we add a binary decision for every opponent (p2 if we're finding a p\ strategy) 
information set at the root of the subproblem. One choice immediately ends the game, 
with a value based on the counterfactual value observed while solving the trunk. The 
other choice leads to the subproblem. This arrangement ensures that p\ finds a strategy 
which minimises p 2 best response values, without expending too much effort on the 
cases where p 2 would simply choose not to play. We discuss the process of recovering 
subproblem strategies in detail in section [5] 

4 Generating the Trunk Strategy using CFR-D 

CFR-D finds an approximation of a Nash equilibrium by generating a pair of low- 
regret strategies. To save space, the game is split up into a trunk and a number of 
subproblems, and the action probabilities are only saved within the trunk. All of the 
subproblems must be independent, so that for any state h in the subproblem, for all 
j € I{h), j is also in the subproblem. 

We also need to augment the set of information sets in the trunk, at the root of each 
subproblem. For all players p, we must partition the states at the root of a subproblem 
according to the information set / which is the last information set at which player p 
acted. Note that this is a slight extension of the usual definition of an information set, 
which is only defined for the player which is acting. These special information sets 
partition the states at the root of the game for both players. 

As part of the solution process, we will need to be able to solve a subproblem given 
a trunk strategy a. With a fixed policy a in the trunk, solving a subproblem means 
finding a strategy profile a for the subproblem such that for either player p, cr p [sG^a p ] 
is a best response to <J- p {sG^d- p \ within the restricted space of strategies that play 
like a outside of the subproblem. The solution a must also satisfy one additional 
constraint to satisfy the counterfactual nature of CFR. If the probability of a player 
reaching an information set / is 0, that is 7r5 (/) = 0, the strategy after / must still be 
a best response even if the trunk strategy was changed so that tt^ (I) was not (i.e., 
a counterfactual consideration.) Without this constraint, a will not necessarily have 
non-negative counterfactual regret in the subproblem. 

To see why there can be positive counterfactual regret, consider the case when 
7Tp (/) = for some / at the root of a subproblem. For any information set J 6 I p 
which is a descendant of /, u^ [SG ^ a] (J, a) — regardless of the policy cr( J) because 
7Tp(J) is part of each term in that sum. At the same time, the counterfactual value 
Vp lSG ^" ] ( J, a) is not multiplied by ttZ{I) and can have a non-zero value. If every 
possible policy under / is a part of a Nash equilibrium because the values are uniformly 
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0, an arbitrarily chosen Nash equilibrium is unlikely to achieve the best counterfactual 
value, which may be non-zero. 

There are at least three possible strategies for finding an equilibrium which satisfies 
the counterfactual value constraint. First, we could generate an arbitrary Nash equilib- 
rium, and then fix it with a post-processing step which computes the best response 
using counterfactual values whenever the reach probability is 0. Second, we could try 
directly adding the constraint to some other solution method, like a sequence form lin- 
ear program |6| or iterated smoothing 0. Finally, we could simply use some CFR 
variant to solve the subproblem, as they naturally produce strategies with the desired 
property. Note that because CFR-D is a solution method, we could use CFR-D itself as 
a solution method for the subproblems. With this recursive decomposition, and a game 
with sufficient structure, the memory requirements to find the top-level trunk strategy 
would be linear in the depth of the game. 

CFR-D, our proposed method for solving the trunk of a game, is an iterative algo- 
rithm. At each step, there are three stages. First, the current trunk strategy is computed 
from the regrets, and the average trunk strategy is updated. Next, subproblems are 
examined one at a time. Each subproblem is solved, and using this solution, counter- 
factual values are computed and recorded for the special information sets at the root 
of the subproblem, for both players. The subproblem solution is then discarded. In 
the final stage of a CFR-D iteration, counterfactual values are propagated up the trunk 
from the subproblems and from any terminal states which are in the trunk, updating 
regrets as we go. Solving a game using CFR-D is described in Algorithm[T] 

If the counterfactual regret at an information set / at the root of a subproblem is 
bounded by £5 at each time step, then at time T the accumulated full counterfactual 
regret Rj uU (I) < Tes- Following Zinkevich et al.'s argument in Appendix A.l ifTol . 

the average regret over the whole game will be bounded by NtrV AT /T + N$es, 
where Ntr is the number of information sets in the trunk, and N$ is the number of 
information sets which are at the root of a subproblem, and A is the maximum number 
of available actions at an information set in the trunk. 

For simplicity, CFR-D was described using the original CFR algorithm of Zinke- 
vich et al. in the trunk. There is nothing that precludes using sampling variants like 
MCCFR instead. Some variants of MCCFR, like external sampling [8 1, are often faster 
than CFR, and this may also be the case in CFR-D for some games. 

5 Recovering a Subproblem Strategy 

We now present a method of recovering a strategy in a subproblem, given some in- 
formation about the root of the subproblem. Without this, CFR-D is largely useless: 
unless we are only interested in the game value, at some point we will presumably want 
to know the action probabilities at some information set in a subproblem, but they have 
been discarded. This problem might also arise in other situations. For example, we 
might wish to move a Nash equilibrium strategy from some large machine to one with 
very limited memory. If we can recover the strategy in a subproblem, we can throw 
away parts of the original strategy until the remaining portion is small enough. 

Recovering a subproblem strategy requires a modified game. To see why, consider 
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ALGORITHM 1: CFR-D 

Input: Number of time steps T, and an extensive-form game partitioned into a trunk and 
subproblems 

Output: action probabilities probsi, a for information sets in the trunk, and counterfactual 
values cfVpj for both players at all special information sets at the root of a 
subproblem 

r = 0; probs = 0; cfv = 0; 

for each time step t from 1 to T do 

for each information set I in the trunk and action a do 
oi, a = max(0, ri,a)/^2 a ' max(0,r Iya >); 
probs Ita = probs Ita + a lM * -Kp (I) (I)/T; 

end 

for each subproblem S do 
d = SOLVE(S); 

for each player p and special information set Is, P at root of S do 

v i s , P = E se z(/) 7r -p^ <Tl ( z ) 7r p( 2; [ / ]. 2 ) u p( z ); 
C K, = c f v is, P + V is,p/ T > 

end 

end 

for information set I in the trunk, visited in post-order depth first order do 
childval — 0; 
for each action a do 

for each child J of I consistent with a do 
childval a — childval a + o(I, a)vj; 

end 

end 

vi — childval ■ probs j; 
for each action a do 

ri, a = ri t a + childval a — vr, 

end 

end 

end 



what happens if we directly use the trunk strategy probabilities to re-solve the sub- 
problem, as done by PS-Opti 1 1 1 and GS1 |4|. A simple counterexample can be found 
in the case where the trunk strategy for pi never reaches the subproblem, so that 
tti(I) = for all pi information sets / at the root of the subproblem. This implies 
^ [sow] ^ _ q £ 0J . z g gQ an( j su bp ro blem strategy a. Any p2 subproblem 
strategy is then part of a Nash equilibrium of the subproblem. For many of these strate- 
gies, however, the counterfactual values for pi — the utility they would get if they had 
played into the subproblem — may be higher than the value pi gets elsewhere in the 
game. If the counterfactual value of the subproblem is higher than the counterfactual 
value elsewhere, p\ has incentive to alter their play in the trunk. If p\ can achieve 
a higher value by changing their policy, the combination of the original p2 strategy 
and the new pi subproblem strategy can not be part of the original Nash equilibrium 
strategy in the full game. 
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leaf utility: u(z) 



leaf utility: u(z) = u(z)*k 

Figure 2: Construction of the Recovery Game from a Subproblem 

For a more concrete example, consider the game in the right side of Figure [T] intro- 
duced in Section[T] Let the trunk be the player one choices of tic-tac-toe'/checkers' and 
tic-tac-toe"/checkers", and the two subproblems are tic-tac-toe' plus tic-tac-toe", and 
checkers' plus checkers ". Let us also assume that the subproblems are set up so that in 
every Nash equilibrium, player one always chooses to play checkers' or checkers". If 
the trunk strategy is part of an equilibrium, then the probabilities of player one reaching 
tic-tac-toe' and tic-tac-toe" are both 0. Because player one does not reach the tic-tac- 
toe games, if we re-solve the tic-tac-toe'/tic-tac-toe" subproblem independently of the 
whole game, player two is free to pick any policy a tu , as the player two value (actual 
or counterfactual) of every subproblem policy is 0, including policies where player two 
would be letting player one win. That is, a m is part of a best response to the original 
player one equilibrium strategy. The player one equilibrium strategy, however, may no 
longer be a best response: playing tic-tac-toe might now be a better choice if player 
two lets player one win. 

Note that these previous examples are slightly different than the example used 
to show that the counterfactual value constraint is needed when solving a subprob- 
lem within an iteration of CFR-D. It is also insufficient to simply handle the special 
case where a player never reaches a subproblem: there are more complicated counter- 
examples for other cases if we solve the unmodified subproblem using the trunk policy. 

The problems that arise with using the subproblem directly suggest that in addition 
to solving the subproblem, we wish to minimise the difference between the opponent's 
counterfactual value for the recovery subproblem strategy, and the counterfactual value 
of the subproblem that is achieved by the equilibrium strategy in the whole game. We 
propose using the game shown in Figure|2] From here on, we will assume, without loss 
of generality, that we are recovering a strategy for pi. We will distinguish the recovery 
game from the original game by using ~ to distinguish states, utilities, or strategies for 
the recovery game. 

There is an initial chance node which leads to states f € R, corresponding to all 
states r € R at the root of the subproblem in the original game. Each state f € R occurs 
with probability -K^iK) jk, where the constant k = X^refl 7r -2( r ) i s use d t0 ensure 
that the probabilities sum to 1. R is partitioned into information sets 1^ that align 
with the last choice p2 made, corresponding to the set of states which can be assigned 
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different p 2 reach probabilities given different trunk strategies. For any r , f ' G i?, 
I(r) = 1(f) ■<=>■ (r) = 7^(7*') for all p. These are the same special information 
sets used in the CFR-D trunk solution algorithm. 

At each r € R, p 2 has a binary choice of F or T. After T, the game ends. After F, 
the game is the same as the original subproblem. All leaf utilities are multiplied by k to 
undo the effects of normalising the initial chance event. So, if z corresponds to a leaf 
z in the original subproblem, Ui(z) = ku 2 (z). If z is a terminal state after a T action, 
u 2 (z) = u 2 (f ■ T) = v 2 (1(f)) I Sft£j(r) 7r -2(' 1 )- This means that for any I E I 2 , 
u 2 (I-T)=v° 2 (I). 

No further complications are needed, so that if we solve the proposed game, we 
can directly use the recovered p\ strategy in the original subproblem. In Section [8] we 
give a proof that the exploitability increases by no more than the regret bound on the 
original subproblem, plus the regret of the recovery strategy. Theorem[T]implies that if 
we recover the strategy for both players at all subproblems, the regret of the complete 
recovered strategy is bounded from above by NtrVAT/T + Ns(3es + 2e#), where 
cr is the regret bound in a recovery subproblem. 

6 Experimental Results 

We demonstrate CFR-D using Leduc Hold' em poker. This is a small poker variant 
which has become a testbed for research on imperfect information games |9]|2l. The 
game involves a deck of 6 cards (2 suits and 3 ranks) and two rounds of betting, with 
at most 1 bet and 1 raise per round. Each player starts by paying one chip, with bets 
and raises costing 2 chips in the first round and 4 chips in the second round. The 
game is complicated enough to show many interesting behaviours, but with only 936 
information sets it is small enough that a wide range of experiments can easily be run 
and evaluated. 

While it would definitely be interesting to test CFR-D performance in a much larger 
game like limit Texas Hold' em, there is a serious issue with evaluation. Current solu- 
tion techniques can already solve very large games, and evaluating the resulting solu- 
tions to find the approximation error can be a computation that requires on the order of 
CPU-months, even though the strategy probabilities are simple table lookups. Adding 
in the cost for recovering subproblem strategies for a CFR-D solution would make this 
evaluation a significant computational undertaking. Moving to an even larger game 
which current techniques could not handle, like limit Texas Hold' em, the evaluation is 
completely beyond current computational resources. 

The trunk used was the first round of betting, with five subproblems corresponding 
to the five different betting sequences in the first round which continue to the second 
round. Our implementation of CFR-D also uses CFR for solving subproblems and 
strategy recovery games. All the reported results use 200,000 iterations for each of 
the recovery subproblems. The exploitability numbers reported are the average of the 
expected utilities of the best response strategies for both players. Each line of Figure[3] 
plots the exploitability for different numbers of subproblem iterations, ranging from 
100 to 12,800 iterations. There are results for 500, 2,000, 8,000, and 32,000 trunk 
iterations. 
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Figure 3: log-log plot of CFR-D exploitability results in Leduc Hold' em 



Looking from left to right shows the improvement in solution quality as the quality 
of subproblem solutions improves. The four separate lines show the improvement in 
solution quality for an increasing number of CFR-D iterations. Using 32,000 trunk 
iterations, 12,800 subproblem iterations, and 200,000 recovery game iterations, we can 
drive the exploitability to 7.5 chips per 1,000 hands. 

Given that the error bounds for CFR variants is Q{\T), one might expect ex- 
ploitability results to be a straight line on a log-log plot. In these experiments, however, 
CFR is being run on the trunk, subproblems, and the recovery games. From Section[3J 
the error is a sum of three terms: Ntr/ ' y/\AT), 3Ns€s, and 2N$€r. For each of the 
lines on the graph, Ntr/ \f(AT) and 2N$£r are constant non-zero values. Only eg 
decreases as the number of subproblem iteration increases, so each line is approaching 
Ntr/ \fi-AT) + 2Nscr > 0, which shows up as a plateau on a log-log plot. 

The method of PS-Opti [1| and GS1 |4| has no theoretical guarantee, but there 
remains a question of how well it does in practice. To address that, we use the GS 1 
method in Leduc Hold' em. The initial step, as in CFR-D, is to generate a strategy for 
the trunk. PS-Opti and GS1 both use a static estimate of the value of a subproblem. To 
provide a best case evaluation of the method, we actually solve the entire game with a 
sequence form linear program [6], resulting in a trunk strategy which is guaranteed to 
be part of a Nash equilibrium. Using the action probabilities of the trunk strategy to 
come up with probabilities of reaching a subproblem, we then solved each subproblem 
using additional linear programs. Finally, the subproblem strategies were combined 
with the trunk strategy to get a strategy for the whole game. 

The strategy generated by the unsafe method was exploitable for 0.0561 chips/hand. 
At each step, there was approximately error (on the order of 10~11) so the ex- 
ploitability all lies in the unsafe method of decomposition. Without a modification 
of the method, there is no way to drive the exploitability any lower. In contrast, the 
least exploitable CFR-D data point was already about 7.4 times less exploitable, and 
could be made arbitrarily close to using additional iterations. 
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As a final step to check that we implemented the unsafe method correctly, we ver- 
ified that the strategy generated using the unsafe method achieved an average value of 
against a Nash equilibrium. By construction, the unsafe method is trying to generate 
a strategy profile which is a best response to a Nash equilibrium. The error arises from 
the fact that the unsafe method has no constraints to guarantee that a Nash equilibrium 
is still a best response to the new strategy profile. 

7 Conclusions 

In perfect information games, decomposing the problem into independently handled 
subproblems is a simple and effective method which is used to greatly reduce the space 
and time requirements of algorithms. The incomplete knowledge in imperfect infor- 
mation games has previously meant that decomposition into parts leads to the loss of 
any theoretical guarantees on solution quality. Despite the lack of a bound on ex- 
ploitability, decomposition has occasionally been used in an attempt to solve imperfect 
information games, due to the reduction in space requirements. We present a novel 
method of handling subproblems which retains theoretical bounds on overall solution 
quality. In contrast, we demonstrate that even in a best case scenario, there can be 
significant error in using the existing unsafe method of decomposition. 

8 Proofs 

Theorem[T]gives a proof of the upper bound on exploitability of the recovered strategy. 
The context for this section is as follows. Strategy profile a is an approximation of a 
Nash equilibrium for the whole game. The induced recovery game strategy profile a F 
is the strategy where for all information sets in the subtrees under the F action, a F 
takes the same action as a, and at the p2 information sets where F or T is chosen, p2 
always picks F. As in Section [5] we will be considering the process from the point of 
view of recovering a strategy for p\ . 

Lemma 1 For any p2 strategy p in the original game and p\ strategy p in the recovery 
game, if we let a = {<Ji[sg*-p\ > p)> then for any I £ 1%, (-0 = (-0^2 P (-0- 

Proof 

4(i)= E <m)*um)<m,zW-2m,z)u2{z) 

zez(i) 

= n p 2 (I)Y,*U(4I])/k*^(z[I],z)^Ml],z)Mz)k - <{I)ut pF) {I) 

Z 

□ 



12 



Lemma 2 If a is an e R -Nash equilibrium in the recovery game, < cj < 1, and 
u { ° uBR(ai)) (I) < e s + ul (I) for all I, then 

E~<5-i,BJi(5-i)>/ n ^ n T , ,n . . ~( <7 f < BR (.<?i))fn 

C/«2 "(I) <{\I\-V)£S + (-R + 2_^ C I U 2 i 1 ) 



i 



Proof er and a have the following properties. 

< e R + u% < e R + u 2 ° 

v%(i)<4* uBR(ai)} (i) 

vUl) < uf - BR ^\l) < es + uf(I) = e s + vUl) 

Given this, the maximum difference between c ■ { i ^ Tl ' BR ^ cri ^ an d c ■ u 2 ° ' BR ( a i )> 
occurs when the difference of these sums is concentrated at a single I. That is, for 
some I 

ci = 1 

and for all I' ^ I 

uf x > BR{Bx)) {I') =uf (/') 

uf' BR ^ )) {I') = e s + uf{I') 

c r = 

In this case, the difference is — l)eg + e R . □ 

Theorem 1 Let a be a equilibrium profile approximation, where es is an upper bound 
on the P2 counterfactual regret so that R2(I) < £5 over all I in I 2 . Let a be the 
recovered strategy, with a bound e R on the exploitability in the recovery game. Then 
the exploitability of a is increased by no more than — l)e$ + e R if we use a in the 
subproblem: 

u 2 < [\1 1 - l)e s + e R + u 2 

Proof Let a = (o- 1[SG ^ ai] , BR(a 1[SG ^ ai] )). In this case, 

z^SG z£SG z^SG z£SG 

(1) 

Considering only the second sum, rearranging the terms and using Lemma^ 

zeSG zeifl IeX A 
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A best response must have no less utility than a% ', and we can then apply LemmaQ 

<Y J <( I )^ 1 - BR{dl)) ( I ) < M - 1)*S + en + ^Ul^^^Hl) 
I I 

Because u% (I) — v^iX) an d ^(-f ■ T) = v%(I) for all I, BR(o~f) can always pick 
action F, and we can directly use BR(a[) in the real game, with the same counter- 
factual value. 

= {\I\-l)e s +e R + Y,4{I)vt' BR{ai)) {l) 
i 

Putting this back into line [7] and noting that a best response can only increase the 
utility, we get 

a i\t\ i\ i i {°i <&[SG*-BR(ir)) ^ /i T-i ii | | (cri,BR(ai)) 

u 2 = (Ml - l)es + efi + u 2 <{\I\-l)es + eR + u% 

□ 
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